The main goal of this project is to understand the effectiveness of the marketing campaigns based on the customers’ coupon redemption by evaluating the relationship between their demographic information and transactions. The related data set was acquired from the link below: data
With regards to the data set; there are 7 tables including “Campaign Data, Item Data, Train, Final Train, Coupon Item Mapping, Customer Demographics, and Customer Transaction Data” but mainly it contains following information regarding the customers.
Below, you may find necessary libraries for R codes (You can click to Code button at the right to unhide codes):
pti <- c("sqldf","dplyr","tidyverse","plotly","ggplot2","ggpubr","kableExtra","lubridate")
pti <- pti[!(pti %in% installed.packages())]
if(length(pti)>0){
install.packages(pti)
}
library(sqldf)
library(dplyr)
library(tidyverse)
library(plotly)
library(ggplot2)
library(ggpubr)
library(kableExtra)
library(lubridate)
Below code chunk downloads data from github and makes necessary adjustments by cleaning and converting date columns.
githubURL <- "https://github.com/erenaltunn/Data/blob/main/BDA_Project_Data.RData?raw=true"
load(url(githubURL))
customer_transaction_data$date<-as.Date(customer_transaction_data$date, format = "%Y-%m-%d")
campaign_data$start_date<-as.Date(sub("(.{6})(.*)", "\\120\\2", campaign_data$start_date), format = "%d/%m/%Y")
campaign_data$end_date<-as.Date(sub("(.{6})(.*)", "\\120\\2", campaign_data$end_date), format = "%d/%m/%Y")
Analysis of the products with the highest number of transactions was carried out. In order to perform this analysis, the customer_transaction_data table and the item_table have been joined. With the joining process, the categories of the products were obtained from the item_table and the total amount, total turnover and average discount amount data on category basis were obtained.
As a result of the study, the categories table with the highest total quantity, the highest turnover a month and highest average discount rate, respectively, is given below.
join_item_tran <- sqldf("
SELECT aa.*, bb.brand, bb.brand_type, bb.category
FROM
(customer_transaction_data) aa
LEFT JOIN
(item_data) bb ON aa.item_id = bb.item_id")
Below table is ordered by quantity column and returns the best 10 category by the quantities sold. Because of the nature of Fuel, it has the highest total quantity as it is sold in liters. In these days, people are getting to be more conscious and prefers packaged meat than regular meat. Also, Natural Products category found a place here which means people care about their health.
kbl(sqldf("SELECT Category AS Category,sum(quantity)as Total_Quantity
FROM join_item_tran
GROUP BY category
ORDER BY sum(quantity) DESC
LIMIT 10"),caption="Quantities Sold by Category")%>%
kable_classic(full_width = F, html_font = "Cambria")
| Category | Total_Quantity |
|---|---|
| Fuel | 149824927 |
| Miscellaneous | 21543590 |
| Grocery | 1244334 |
| Pharmaceutical | 189774 |
| Packaged Meat | 86562 |
| Natural Products | 66454 |
| Meat | 30585 |
| Dairy, Juices & Snacks | 27552 |
| Bakery | 25075 |
| Prepared Food | 14130 |
Below table is ordered by total turnover column. As we can see, there are lower average other discount (these discounts are not associated with personal campaigns) with higher turnover value.
kbl(sqldf("SELECT Category AS Category, sum(selling_price) as Total_Turnover,
ROUND((AVG(other_discount)*-1),1) as Average_Other_Discount
FROM join_item_tran
GROUP BY category
ORDER BY sum(selling_price) DESC
LIMIT 10"),caption="Total Turnover by Category")%>%
kable_classic(full_width = F, html_font = "Cambria")
| Category | Total_Turnover | Average_Other_Discount |
|---|---|---|
| Grocery | 86229623 | 17.5 |
| Pharmaceutical | 22871322 | 11.7 |
| Fuel | 12838240 | 30.6 |
| Packaged Meat | 8891032 | 35.5 |
| Meat | 4669631 | 39.8 |
| Natural Products | 4604165 | 7.6 |
| Dairy, Juices & Snacks | 2191815 | 12.7 |
| Miscellaneous | 2169650 | 21.5 |
| Bakery | 2148925 | 10.0 |
| Prepared Food | 1911024 | 10.5 |
Below table is ordered by average other discount column. Seafood, Meat and Package Meat have the highest discount values. As these kind of foods are decaying very quick, stores tend to make more discount on them.
kbl(
sqldf("
SELECT Category AS Category,
ROUND((AVG(other_discount)*-1),1) as Average_Other_Discount
FROM join_item_tran
GROUP BY category
ORDER BY avg(other_discount)
LIMIT 10"),caption="Average Discount by Category")%>%
kable_classic(full_width = F, html_font = "Cambria")
| Category | Average_Other_Discount |
|---|---|
| Seafood | 51.1 |
| Meat | 39.8 |
| Packaged Meat | 35.5 |
| Fuel | 30.6 |
| Garden | 30.3 |
| Skin & Hair Care | 29.5 |
| Miscellaneous | 21.5 |
| Grocery | 17.5 |
| Dairy, Juices & Snacks | 12.7 |
| Pharmaceutical | 11.7 |
A study was carried out to see the historical development of the prices of products sold in high quantity.
In this context, first of all, prices per unit were provided on a daily basis. Later, 9 products with the highest sales numbers in total were identified. Finally, these two tables were joined and the daily sales prices of the 9 products sold in the highest sum were reached.
In the first table below we can see the daily price changes of the top 9 products.
daily_unit_price <- sqldf(
"SELECT aa.*
FROM
(SELECT
date AS date,
item_id,
AVG(selling_price/quantity) as daily_unit_price
FROM join_item_tran
GROUP BY date, item_id) aa
INNER JOIN
(SELECT item_id, SUM(quantity)
FROM join_item_tran
GROUP BY item_id
ORDER BY SUM(quantity) desc
LIMIT 9) bb
ON aa.item_id = bb.item_id")
#daily_unit_price$date <- as.Date(daily_unit_price$date, format="%Y-%m-%d")
ggplot(daily_unit_price, aes(x=date, y=daily_unit_price)) +
geom_line(aes(color="#1A76FF")) +
facet_wrap(~ item_id) +
geom_smooth(color="#37536d", size=0.25)+
theme_minimal()+ theme(legend.position = "none")+
theme(axis.text.x = element_text(angle = 45))+
labs(title = "Time Series Plot for Daily Unit Price")+
xlab("Date")+
ylab("Daily Unit Price")
However, when the relevant table is examined, we see that there are no sales for some products on certain dates. Therefore, in order to see the full changes in prices, we wanted to select 3 products with full series and examine the price development for these 3 products. Item ID 48973 and 49004 are in the category of Local Miscellaneous, Item ID 49009 is in the category of Fuel.
As a result of our investigations, prices for three products reached their peak in July 2012. Prices then drop rapidly and enter a steady upward trend until July of the following year.
daily_unit_price%>%
filter(item_id == "48973" | item_id=="49004" | item_id=="49009") %>%
ggplot( aes(x=date, y=daily_unit_price)) +
geom_line(aes(color="#1A76FF")) +
facet_grid(item_id~. ) +
geom_smooth(color="#37536d", size=0.25)+
theme_bw()+ theme(legend.position = "none")+
labs(title = "Time Series Plot for Daily Unit Price")+
xlab("Date")+
ylab("Daily Unit Price")
When we look at 4 different sectors in local brand, It can be confidently state that the highest average endorsement belongs to Pharmaceutical field whereas the fewest pertains to local grocery brands. On the Established side, Meat sector is seemed that having the highest turnover chance as if pharmaceutical outliers are ignored.
df<- sqldf("
SELECT
id.brand,
id.brand_type,
id.category,
SUM(ct.quantity) AS total_qty,
SUM(ct.selling_price)/SUM(quantity) as avg_turnover,
(SUM(-ct.other_discount)+SUM(-ct.coupon_discount))/SUM(quantity) as avg_total_discount
FROM
customer_transaction_data ct
LEFT JOIN item_data id USING(item_id)
GROUP BY brand,category,brand_type")
y <- list(title = "Average Turnover",
titlefont = F)
x <- list(title = "Brand Type",
titlefont = F)
df <- df %>% filter (category=='Grocery' | category =='Meat' | category=='Pharmaceutical' | category=='Seafood' )%>%filter(avg_total_discount>0)
fig <- plot_ly(df, x = ~brand_type, y = ~log(avg_turnover), color = ~category, type = "box")
fig <- fig %>% layout(boxmode = "group",xaxis=x,yaxis=y)
fig
There are two types of brand types namely, “established” and “local” brands. The most 4 interesting brand category field which are Grocery, Meat, Pharmaceutical and Seafood industries are filtered.
While there are no significant differences among discount quantity distribution of categories in established brands, more vertical gaps are occurred among the figures for Local brands. Additionally, due to the insufficient data quantities in Meat and Seafood of Local brands, they have appeared with narrow variances figure in the graph.
Local brands in seafood industries are ensuring highest amount of discount in the graph. This situation can be explained by wealthiness of countries seafood resources lead to reduce cost in certain seasons.On the other hand, Local brands in meat sector could not provide vast amount of opportunity as much as the amount ensured by established brands in meat sector.
df<- sqldf("
SELECT
id.brand,
id.brand_type,
id.category,
SUM(ct.quantity) AS total_qty,
SUM(ct.selling_price)/SUM(quantity) as avg_selling_turnover,
(SUM(-ct.other_discount)+SUM(-ct.coupon_discount))/SUM(quantity) as avg_total_discount
FROM
customer_transaction_data ct
LEFT JOIN
item_data id USING(item_id)
GROUP BY
brand,category,brand_type ")
y <- list(title = "Average Discount",
titlefont = F)
x <- list(title = "Brand Type",
titlefont = F)
df <- df %>% filter (category=='Grocery' | category =='Meat' | category=='Pharmaceutical' | category=='Seafood' )%>%filter(avg_total_discount>0)
fig <- plot_ly(df, x = ~brand_type, y = ~log(avg_total_discount), color = ~category, type = "box")
fig <- fig %>% layout(boxmode = "group",xaxis=x,yaxis=y)
fig
When we examine the average order value that customers pay for the products they buy, we observed that the maximum paying group is 70+. It was observed that 70+ age group pay an average of 427 USD for married ones, while single ones pay an average of 630 USD. It was observed that the group with the next highest payment is 56-70 group. For this reason we can conclude that people can pay more in retirement age. When each age group is examined, it is seen that married people can pay more, while this situation is reversed in the 70+ group. We can say that for this reason, there are more people who remain single due to deaths.
final_train$age_range <- case_when(final_train$age_range == "0" ~"18-25",
final_train$age_range == "1" ~"26-35",
final_train$age_range == "2" ~"36-45",
final_train$age_range == "3" ~"46-55",
final_train$age_range == "4" ~"56-70",
TRUE ~"70+")
final_train$marital_status <- case_when(final_train$marital_status == "0"~"Married",TRUE ~ "Single")
customer_demographics$marital_status <- case_when(customer_demographics$marital_status == ""~"Married",TRUE ~ customer_demographics$marital_status)
cus_inf<-sqldf(
"SELECT
cd.age_range,
cd.marital_status,
AVG(ft.price_sum/ft.customer_id_count) AS pprice_sum
FROM customer_demographics cd
LEFT JOIN
final_train ft ON ft.customer_id = cd.customer_id
GROUP BY
cd.age_range,cd.marital_status
ORDER BY
AVG(ft.price_sum) DESC")
mar <- cus_inf %>%
filter(marital_status == "Married")
sing <- cus_inf %>%
filter(marital_status == "Single")
married_psum <- round(mar$pprice_sum,2)
single_psum <- round(sing$pprice_sum,2)
age_rang <- mar$age_range
data <- data.frame(age_rang,married_psum,single_psum)
fig <- plot_ly(data, x = age_rang, y = married_psum, type = 'bar',name = 'Married',
marker = list(color = 'rgb(55, 83, 109)'))
fig <- fig %>% add_trace(y = ~single_psum, name = 'Single',marker = list(color = 'rgb(26, 118, 255)'))
fig <- fig %>% layout(title = 'Average Basket Value by Age and Marriage Status',
xaxis = list(
title = "Age Groups",
tickfont = list(
size = 14,
color = 'rgb(107, 107, 107)')),
yaxis = list(
title = 'Average Order Value in USD',
titlefont = list(
size = 16,
color = 'rgb(107, 107, 107)'),
tickfont = list(
size = 14,
color = 'rgb(107, 107, 107)')),
legend = list(x = 0, y = 0.9, bgcolor = 'rgba(255, 255, 255, 0)', bordercolor = 'rgba(245, 246, 249, 1)'),
barmode = 'group', bargap = 0.15, bargroupgap = 0.1)
fig
Analyzing campaigns and customer base is crucial to marketing department, because generating revenue by marketing campaigns have the biggest share in most of the companies’ financial statements. Regarding to this, we made a basic campaign analysis and customer segmentation to drive targeting with the given data set.
Below table shows us the return rates for each Coupon ID. Coupon ID 586 has the biggest return rate and it is associated with Grocery products. As grocery is an essential product group for living, it is understandable to be at the top of this list.
kbl(sqldf("SELECT
coupon_id AS 'Coupon ID',
ROUND(AVG(redemption_status),3) AS 'Return Rate'
FROM final_train
GROUP BY coupon_id
ORDER BY 2 desc
LIMIT 10"))%>%
kable_classic(full_width = F, html_font = "Cambria")
| Coupon ID | Return Rate |
|---|---|
| 586 | 0.142 |
| 754 | 0.119 |
| 661 | 0.118 |
| 9 | 0.117 |
| 21 | 0.103 |
| 960 | 0.093 |
| 22 | 0.090 |
| 786 | 0.086 |
| 6 | 0.086 |
| 671 | 0.083 |
Below code chunk, creates two tables with customers which have average order value above 500 USD and below 500 USD. It shows their average coupon counts. It seems like higher average basket leads to higher coupon counts.
kbl(sqldf("WITH basket AS(
SELECT customer_id,SUM(selling_price)/COUNT(DISTINCT Date) AS AOV
FROM customer_transaction_data
GROUP BY customer_id),
coupon_cnt AS(
SELECT
customer_id,
COUNT(coupon_id) AS Coupon_Count
FROM
final_train
GROUP BY
customer_id
)
SELECT 'Less than 500' AS Category,ROUND(AVG(b.Coupon_Count),0) AS 'Average Count'
FROM basket a
INNER JOIN coupon_cnt b ON a.customer_id=b.customer_id
WHERE a.AOV<500
UNION
SELECT 'Greater than 500' AS Category,ROUND(AVG(b.Coupon_Count),0) AS 'Average Count'
FROM basket a
INNER JOIN coupon_cnt b ON a.customer_id=b.customer_id
WHERE a.AOV>500"
))%>%
kable_classic(full_width = F, html_font = "Cambria")
| Category | Average Count |
|---|---|
| Greater than 500 | 56 |
| Less than 500 | 46 |
RFM analysis is a marketing technique that allows us to explore different customer profiles in our customer base. RFM stands for Recency, Frequency and Monetary. In this analysis, we wanted to add campaign usage (RFMC) to consider customers who are responsive to marketing campaigns. In the first query, We are calculating RFMC values and setting Segments. Calculation method follows below logic:
Let us explain RFM dataframe’s common table expressions one by one:
RFM_Table dataframe’s query uses case when expressions to create specific segments for our customer base.
RFM<-sqldf(
"WITH RFM AS(
SELECT
customer_id,
((SELECT MAX(date) as max_dt FROM customer_transaction_data)-MAX(date)) AS Recency,
COUNT(DISTINCT date) AS Frequency,
SUM(selling_price) AS Monetary
FROM customer_transaction_data
GROUP BY customer_id
),
coupon_cnt AS(
SELECT
customer_id,
COUNT(coupon_id) AS Coupon_Count
FROM
final_train
GROUP BY
customer_id
),
aggregated AS(
SELECT
a.customer_id AS Customer_ID,
a.Recency,
a.Frequency,
ROUND(a.Monetary/a.Frequency,1) AS AOV,
b.Coupon_Count
FROM
RFM a
INNER JOIN
coupon_cnt b ON a.customer_id=b.customer_id
GROUP BY
a.customer_id),
scored AS(
SELECT
Customer_ID,
Recency,Frequency,AOV, Coupon_Count,
ntile(4) over (order by Recency desc) as Recency_Score,
ntile(4) over (order by Frequency) as Frequency_Score,
ntile(4) over (order by AOV) as Monetary_Score,
ntile(4) over (order by Coupon_Count) as Cmp_Count_Score
FROM
aggregated)
SELECT *, Recency_Score||Frequency_Score||Monetary_Score||Cmp_Count_Score AS Segment
FROM scored")
kbl(head(RFM))%>%
kable_classic(full_width = F, html_font = "Cambria")
| Customer_ID | Recency | Frequency | AOV | Coupon_Count | Recency_Score | Frequency_Score | Monetary_Score | Cmp_Count_Score | Segment |
|---|---|---|---|---|---|---|---|---|---|
| 1283 | 429 | 28 | 1071.5 | 14 | 1 | 1 | 3 | 1 | 1131 |
| 759 | 380 | 11 | 1638.7 | 22 | 1 | 1 | 4 | 1 | 1141 |
| 975 | 364 | 17 | 1183.6 | 18 | 1 | 1 | 3 | 1 | 1131 |
| 1132 | 346 | 30 | 905.8 | 19 | 1 | 1 | 2 | 1 | 1121 |
| 1112 | 265 | 72 | 949.2 | 21 | 1 | 2 | 2 | 1 | 1221 |
| 1261 | 233 | 30 | 1753.3 | 27 | 1 | 1 | 4 | 1 | 1141 |
RFM_Table<-sqldf(
"SELECT
*,
CASE
WHEN Segment IN ('4444','4344','4434','4433','4443','3444','3443','4442')
THEN 'Star'
WHEN Segment IN ('3442','3441','3431','3432','3443','3444','3433','3434','4431','4432','4433','4434',
'4341','4342','4343','4344','3333','3331','3332','3334','3341','3342','3343','3344','4331','4332','4333','4334')
THEN 'High Value'
WHEN Segment IN ('3411','3412','3413','3414','4411','4412','4413','4414','4422','4324',
'4421','4423','4424','4311','4312','4313','4314','3422','3421','3423','3424','4322','4321','4323','3311', '3312','3313','3314','3321','3323','3324','3322','3122','3121','3123','3124','3221','3223','3223','3224')
THEN 'Second Role'
WHEN Segment IN ('1331','1332','1431','1432','1433','1333','1334','1434','1341','1342',
'1343','1344','1441','1442','1443','1444','1244','1243','1242','1241')
THEN 'Churned Best'
WHEN Segment IN ('1113','1114','1213','1214','1123','1124','2113','2114','2213','2214',
'2123','2124','1223','1224','2224','2224')
THEN 'Responsives'
WHEN Segment IN ('4141','4142','4143','4144','4131','4132','4133','4134','4241','4242',
'4243','4244','4231','4232','4233','4234')
THEN 'High Value New'
WHEN Segment IN ('4111','4112','4113','4114','4121','4122','4123','4124','4211','4212',
'4213','4214','4221','4222','4223','4224')
THEN 'Low Value New'
WHEN Segment IN ('1111','1211','1121','1112','2111','2211','1221','1122','1222','2221',
'2122','2212','2112','2121','2222','1212','1131','1231','1132','1232','1141','1241','1142','1242','1133', '1131','1132','1134','1141','1142','1143','1144','1321','1322','1323','1324','1311','1312','1313','1314')
THEN 'Churned'
WHEN Segment IN ('2331','2332','2333','2334','2231','2232','2233','2234','2341','2342',
'2343','2344','2431','2432','2433','2434','3331','3332','3333','3334','3231','3232','3233','3234','2341', '3342','3343','3344','3431','3432','3433','3434','2441','2442','2443','2444','4241','4242','4243','4241', '4422','4423','4424','3241','3242','3243','3244')
THEN 'Promising'
ELSE 'Other'
END AS RFM_Segment
FROM RFM")
kbl(head(RFM_Table))%>%
kable_classic(full_width = F, html_font = "Cambria")
| Customer_ID | Recency | Frequency | AOV | Coupon_Count | Recency_Score | Frequency_Score | Monetary_Score | Cmp_Count_Score | Segment | RFM_Segment |
|---|---|---|---|---|---|---|---|---|---|---|
| 1283 | 429 | 28 | 1071.5 | 14 | 1 | 1 | 3 | 1 | 1131 | Churned |
| 759 | 380 | 11 | 1638.7 | 22 | 1 | 1 | 4 | 1 | 1141 | Churned |
| 975 | 364 | 17 | 1183.6 | 18 | 1 | 1 | 3 | 1 | 1131 | Churned |
| 1132 | 346 | 30 | 905.8 | 19 | 1 | 1 | 2 | 1 | 1121 | Churned |
| 1112 | 265 | 72 | 949.2 | 21 | 1 | 2 | 2 | 1 | 1221 | Churned |
| 1261 | 233 | 30 | 1753.3 | 27 | 1 | 1 | 4 | 1 | 1141 | Churned |
You may find Segment definitions below:
Below plot shows us the distributions of our segments. We have Churned customers at the top. It can be because of the non-activated customers that we couldn’t place the second order to their baskets. We have middle to low amount of High Value and Star customers which is good because we can’t afford to treat a lot of customers as champions.
p<-RFM_Table%>%select(Customer_ID,RFM_Segment)%>%
group_by(RFM_Segment)%>%
summarize(cnt=n())%>%
ggplot(aes(x=reorder(RFM_Segment, cnt), y=cnt)) +
geom_bar(stat="identity", fill="#37536d")+
theme_minimal()+
coord_flip()+
labs(title = "RFM Segment Distribution")+
xlab("Segment")+
ylab("Count")
ggplotly(p)
Below plots show us the distribution of RFMC metrics based on our RFM segments.
We have higher median monetary in Churned Best, Star and High Value segments. This makes sense but gives us an insight that we should find a way to get back Churned Best customers immediately as they have the highest median for monetary.
As expected, we have zero median recency for New segments and Stars. It is obvious that Stars are purchasing from our stores frequently.
We have higher median frequency for Stars and Second Roles. We would expect that because Second Role customers are loyal but they have a tendency to become a Churned customer. It is possible for them to have higher frequency but they might be low on monetary part.
We have higher Campaign Usage in Star and Responsive segments. This is expected because we created Responsive segment based on the coupon redemption data. These kind of customers need a trigger to purchase but they might not be profitable like other customer segments. Because they always wait for a campaign or discount to purchase. So if we would get costs for items, we could show that the net profit for responsive segment is really low than other segments.
ggarrange(RFM_Table%>%select(RFM_Segment,AOV)%>%
group_by(RFM_Segment)%>%
summarize(mdn=median(AOV))%>%
ggplot(aes(x=reorder(RFM_Segment, mdn), y=mdn)) +
geom_bar(stat="identity", fill="#37536d")+
theme_minimal()+
coord_flip()+ labs(title = "Median Monetary")+
xlab("Segment")+
ylab("Median"),
RFM_Table%>%select(RFM_Segment,Recency)%>%
group_by(RFM_Segment)%>%
summarize(mdn=median(Recency))%>%
ggplot(aes(x=reorder(RFM_Segment, mdn), y=mdn)) +
geom_bar(stat="identity", fill="#37536d")+
theme_minimal()+
coord_flip()+ labs(title = "Median Recency")+
xlab("Segment")+
ylab("Median"),
RFM_Table%>%select(RFM_Segment,Frequency)%>%
group_by(RFM_Segment)%>%
summarize(mdn=median(Frequency))%>%
ggplot(aes(x=reorder(RFM_Segment, mdn), y=mdn))+
geom_bar(stat="identity", fill="#37536d")+
theme_minimal()+
coord_flip()+ labs(title = "Median Frequency")+
xlab("Segment")+
ylab("Median"),
RFM_Table%>%select(RFM_Segment,Coupon_Count)%>%
group_by(RFM_Segment)%>%
summarize(mdn=median(Coupon_Count))%>%
ggplot(aes(x=reorder(RFM_Segment, mdn), y=mdn))+
geom_bar(stat="identity", fill="#37536d")+
theme_minimal() +
coord_flip()+ labs(title = "Median Coupon Count")+
xlab("Segment")+
ylab("Median"))